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Abstract 

The present study examines the influence of judges' item-related 
knowledge on setting standards for competency tests. Seventeen 
judges from different professions took a 122-item teacher 
certification test in economics while setting competency 
standards for the test using the Angoff procedure. Judges tended 
to set higher standards for items they got right A lower 
standards for items they had trouble with. Inter judge and 
intrajudge consistency were higher for items all judges got right 
than items some judges got wrong. Procedures to make uniform 
judges' test-related knowledge and experience are discussed. 
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Judge Competency 3 
Does a standard reflect minimal competency 
of examinees or judge competency? 

In the past four decades, numerous procedures have been 
introduced and refined to establish performance standards on 
criterion-referenced achievement tests (Jaejer, 1989; Cizek, 
1993) . All of these procedures are judgmental and arbitrary 
(Jaeger, 1976, 1989; Glass, 1978). They entail, in varying ways, 
judges' perceptions of how minimally competent examinees would 
perform on each item of the test. Judgmental errors arise when 
judges differ in their conceptualizations of minimal competency 
and, within judges, when such conceptualizations are not stably 
maintained across items. The motivation behind the four decades 
of experimenting with different standard setting methods is to 
reduce these errors or to maximize intrajudge and interjudge 
consistency in reaching judgements. 

What are the possible causes of judgmental inconsistencies 
both within and across judges? Plake, Melican, and Mills (1991) 
classified the potential causal factors into three categories in 
relation to judge backgrounds, items and their contexts, and 
standard-setting processes. Among the judge-related factors, 
judges' specialty and professional skills are suspected to 
influence their item ratings during standard setting (Plake et 
al., 1991). In many content areas, the domain of knowledge is so 
broad that it is unrealistic to expect the judges to know 
everything (Norcini, Shea, & Kanya, 1988) on the test even though 
they are considered experts. The fact that judges are often 
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deliberately selected to represent different professional 
experiences (Jaeger, 1991) makes it more difficult to assume that 
their domain knowledge in relation to each individual item on a 
test is a constant but not a variable. Empirical findings of 
markedly different standards derived by judges of different 
professions (e.g., Jaeger, Cole, Irwin, & Pratto, 1980, cited 
from Jaegei, 1989; Roth, 1987) may be explained by the judges' 
different training and vocational focuses regarding a broadly 
defined domain of knowledge. Another empirical finding is that 
judges have different perceptions about minimal competencies (ven 
de Linden, 1982; Plake et al., 1991). It is logical to suspect 
that judges' different professional focuses influence their 
perceptions of minimal competency in relation to an item. To 
what extent, then, does a competency standard derived for 
minimally competent examinees reflect the strengths and 
weaknesses of the judges with respect to the content domain of 
competency? 

To date, only one empirical study has attempted to 
in\ -:stigate this question. Norcini et al. (1988) compared three 
cardiologists with three pulmonologists in their ratings of items 
representing these two ^ ;cialty areas. There was no 
statistically significant difference in ratings between the two 
groups of three specialty judges. These results, however, are 
inconclusive for two reasons. First, the independent variable, 
specialty expertise, was not operationally defined; in other 
words, there was no objective evaluation of judges' item-related 
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Judge Competency 5 
expertise in each content area. The vagueness of expertise 
distinction was further muddled by the fact that all six judges 
were involved in writing and reviewing the items being rated. As 
the authors admitted, "This experience may have made them 
11 experts" in the narrow domain of the questions on ttu. 
examination and mitigated the effect of specialization" (p. 60) . 
Other researchers have echoed similar criticism (e.g., Plake et 
al., 1991). 

In the present study, item-related expertise of the judges 
is operationally defined by having the juduges take the test for 
which they are to provide competency standard. It is 
hypothesized that (1) judges will set a higher standard for items 
they answer correctly than for items they answer incorrectly, and 
(2) intrajudge and interjudge consistency will both be higher 
when all of the judges answer all of the items correctly than 
when some of the judges answer some of the items incorrectly. 
Interjudge and Intrajudge Consistency 

Interjudge consistency refers to the degree to which 
standards derived by different judges agree with each other. 
Intrajudge consistency (ven de Linden, 1982) refers to the degree 
to which an individual judge's estimate of item difficulty is 
consistent among items. It is usually evaluated by comparing a 
judge's estimate of item difficulty with an empirical item 
difficulty 1 , both of which are based on minimally competent 
examinees. Intrajudge consistency can also be viewed as internal 
consistency reliability of judge-estimated item difficulties 
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(Friedman & Ho, 1990). Reflecting Friedman and Ho's definition 
of intrajudge consistency and the definition of interjudge 
consistency, Brennan and Lockwood (1980) used generalizability 
theory to estimate judgment errors both within and across judges 
associated with the Angoff and Nedelsky procedures. The present 
study uses Brennan and Lockwood 's approach and examines 
intrajudge and interjudge consistency viewed from the perspective 
of generalizability theory. The following discusses interjudge 
and intrajudge consistency within generalizability theory. 

Xji indicates a judge's score on a item from the population 
of judges and universe of items. The expected value of a judge's 
observed score is /Xj s EjXjj. The sample estimate is The 
expected value of an item is Mi s EjX^. The corresponding sample 
estimate is Xj. The expected value over both judges and items is 
/x s EjEjXjj. The sample estimate is X or the cutting score. 

Xjj can be expressed in terms of the following equation: 

X^ = m + Mj~ + Mi- + MrT 
where ju. is the grand mean, 

Mj~ = Mj - M is the judge effect, 

li~ = /x. - fi is the item effect, 

Mji~ = X ri - /Xj - /Zj - /x is the residual effect. 

For each of the three score effects there is an associated 
variance component. They re: 

a 2 (j) = E r (/Xj - M) 2 

a 2 (i) = Ei(/Xi - /x) 2 

a 2 (ji) - EjEifXjj - ^ - /Xi + M) 2 
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The three variance components are estimated by equating them 
to their observed mean squares in ANOVA: 
d 2 (j) = [MS(j) - MS(ji)] / n,; 
d 2 (i) = [MS(i) - MS(ji) ] / n j; 
d 2 (ji) = MS (ji) . 

Adding up these estimates of variance components gives the 
estimate for the expected observed score variance: 

d 2 (x j{ ) = d 2 (j) + d 2 (i) + a 2 (ji) (i) 

These variance components are associated with a single 
judge's score on a single item (X^) . In a standard setting 
situation, a sample of n'j judges and n'; items are used to 
estimate X, the cutting score. By the central limit theorem, the 
variance associated with X is: 

d 2 (X) = d 2 (j)/n'j + d 2 (i)/n' i + d 2 (ji)/n' j n' i (2) 
d 2 (X) consists of two components: 
d 2 (X j ) = a 2 (i)/n'; + (^(jij/n'jn'i (3) 
d 2 (Xj) = d 2 (j)/n'j + d 2 (ji)/n' j n' i (4) 
Equations (3) and (4) represent intra judge and inter judge 
inconsistencies when n'j judges and n'i items are used to 
estimate the standard, /z. If some items are more difficult than 
others, the selection of items will influence the judgement for a 
minimally competent examinee's absolute level of performance. 
Thus, d 2 (i)/n'j is considered intrajudge inconsistency since it 
has a direct impact on the expected value of a judge , ^. 
d 2 (j)/n'j represents inter judge inconsistency because it 
influences the expected value of an item over judges, fi t . It 
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Judge Competency 8 
shows that, if judges have different perceptions of minimal 
competency and/or item difficulties, the selection of judges will 
change the item difficulty. Finally, d 2 ( ji) /n'jn'j contributes 
both to intra judge and inter judge inconsistency. Part of the 
judge-item interaction indicates that differences in leniency or 
stringency among judges are registered differently on different 
items. In other words, judges fail to maintain their standards 
across items. With a single observation for each judge-item 
combination, the last interpretation is, however, confounded with 
other unexplainable effects • 

Method and Results 
The Test, Judges, and Standard-Setting Procedures 

The Florida Teacher Certification Examination in Economics 
was used to examine the influence of judges' item competency. 
The test contained 122 4-choice items. Seventeen judges were 
selected from the state to set competency standards for this 
test. They consisted of certified high school chemistry teachers 
Education professors, and district supervisors. The teachers had 
varying years of classroom experience. A modified Angoff (1971) 
procedure was used. Judges were first instructed about the 
Angoff procedure. They were then administered the 122-item test. 
While taking the test, they estimated item difficulty for 
minimally competent examinees. They were then given their own 
test scores and Angoff scores, means and frequency distributions 
of the panel's test scores and Angoff scores, and the mean and 
frequency distribution of a sample of examinees who took 
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the test. With these information packets, they engaged in 
subsequent "Closure with Consensus" discussions. With this 
technique, the panel was divided into smaller groups to discuss 
the material and reach consensus on the cut-off score. Having 
reached consensus within groups, each group sent an emissary to 
another group to form new groups to continue with the 
deliberation. This emissary process was repeated until consensus 
was reached among all judges regarding the passing score. Data 
reported in this study consisted of the individual judges' 
initial test scores and Angoff scores before the open group 
discussion. 
G-Study 

A random effect j x i crossed design ANOVA was conducted 
within the whole sample and two subsamples. The whole sample was 
an Angoff score matrix of 122 items by 17 judges. The two 
subsamples had Angoff scores from the same 17 judges on a subset 
of 46 items. In one subsample, the 46 items were ones that all 
17 judges answered correctly when taking the test. This 
subsample will be referred to as the "homogeneous knowledge" 
sample. The other subsample had a different set of 4 6 items 
where each of the 17 judges missed at least 5 items when taking 
the test. This subsample will be called the "heterogeneous 
knowledge" sample. Variance components and intra judge and 
interjudge inconsistencies were compared among these three 
samples. 
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Insert Tables 1 and 2 about here 



The G-study results from the three samples are reported in 
Table 1 and intrajudge and interjudge inconsistencies are 
reported in Table 2. Variance components, d 2 (j) and d 2 (ji), 
estimated from the heterogeneous knowledge sample were much 
larger than those from the homogeneous knowledge sample. cx 2 (i) 
was similar across the two samples. Correspondingly, judgements 
were more consistent across judges (interjudge consistency) when 
they knew the answers to all the items on the test. Interjudge 
consistency was much worse for the items to which judges did not 
know all the answers. Intrajudge consistency was similar across 
the two samples although it was still higher for the items judges 
knew the answers to than those items some of the judges did not 
know the answers to. These findings supported the hypothesis 
that lack of content knowledge increases errors in standard- 
setting. 
T-Tests 

T-tests were conducted within an individual judge to test 
the second hypothesis that a judge's standard was higher for 
items he/she knew the answers to than for items he/she did not 
know. The t-test compared a judge's average Angoff score, the 
standard, derived from the items she/he got right when taking the 
test against the standard based on the items she/he got wrong. 
For one judge who did not miss any items, such a comparison was 
not possible. Thus there were 16 t-tests. The results are 
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reported in Table 3. 



Insert Table 3 here 



As can be seen from Table 3, for all 16 judges, their Angoff 
ratings were much higher for items they knew than for items they 
did not know. Fifteen out of the 16 t-tests were significant, 
ok. 05, Apparently, when a judge knew the answer, the judge 
expected a larger proportion of minimally competent examinees to 
get the item right than when the judge himself or herself had 
trouble with the item. 

Discussion 

The results from this study are straightforward. Judges' 
domain knowledge related to the items on a test affect standard- 
setting both in terms of the mean, or the standard, and variance, 
or errors surrounding the standard. As a matter of common sense, 
judges tend to set relatively higher standards for items they 
know and lower standards for items they do not know. The problem 
is that the standard thus derived reflects not the minimal 
competency of the examinees as it should, but the competency of 
the judges. 

Judge competency has similar influences on the consistency 
of the standard. Inter judge inconsistency arises as a result of 
the heterogeneous competency background of the judges. When some 
of the judges do not know some of the items, there is more 
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discrepancy in the standards derived. On the other hand, 
judgement is more consistent for items to which all judges know 
the answers. 

One implication of this study is that more emphasis should 
be placed on training judges prior to standard setting. When 
judges come from different professions and experiences, it is 
only natural that they have different focuses on the knowledge 
domain of which the competency test is a sample. Consequently, 
they may not be uniformly familiar with every itftm on the test. 
Item-related training, including having the judges take the test, 
will make uniform their experience and expertise so as to reduce 
inter judge and intra judge inconsistency. 

Logically, however, those who initially did not know an item 
and learned it through training could still be more lenient when 
judging that item than other items they knew initially. On the 
other hand, judges vho did better on the test initially may be 
more stringent in rendering standards than those who did worse 
despite training. Thus, item related training should also be 
accompanied by specific instructions to guard against setting 
"judge competency standards" found in this study. Having the 
judges take the test and providing them with the test information 
will help in this regard. For example, knowing that 90% of the 
panel answered the item correctly, a judge who failed the item is 
likely to change his/her otherwise low estimate of >m 
difficulty which reflecting the judge's lack of item competency. 
The results of judges' initial tests can also be used to screen 
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judges by eliminating the outliers. 

Findings from the present study also provide clues to the 
lack of equitability among different standard-setting methods 
(Andre & Hecht, 1976; Skakun & Kling, 1980; Koffler, 1980; 
Brennan & Lockwood, 1980; Poggio, Glasnapp, & Eros, 1981; Mills, 
1983; Cross, Impara, Frary, & Jaeger, 1984; Jaeger, 1989). Among 
the different procedures, the Nedelsky method was often found to 
produce lower standards (Andrew & Hecht, 1976; Shepard, 1980; 
Skakun & Kling, 1980; Brennan & Lockwood, 1980; Poggio, Glasnapp, 
& Eros, 1981; Cross, Impara, Frary, & Jaeger, 1984). In light of 
the present study, the lower Nedelsky standard may be due to the 
fact that judges' own difficulty with items are more directly 
tested with the Nedelsky procedure where the judges have to go 
through all the alternative answers to eliminate the wrong ones. 
A judge has to evaluate the similarities and differences among 
the response options (Smith and Smith, 1988) to determine the 
probability of eliminating the wrong answers. Such a process 
taxes a judge's knowledge much more frequently than does 
determining the difficulty of the item as a whole in the Angoff 
and other procedures. It is likely that a judge who is fairly 
confident of the answer to the item becomes more doubtful of 
his/her item-related knowledge when going through each 
alternative in the Nedelsky method. According to the findings of 
the present study, the judge's doubt about an item will be 
reflected in a lower Nedelsky standard. 

Quasi-experimental studies can be conducted to further test 
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the influence of judge's domain knowledge. Specifically, the 
Nedelsky method can be compared with the Angoff method for judges 
expected to know the items, e.g., judges who were involved in 
developing the items, and for judges who are not expected to know 
all the answers on the test. We anticipate a greatly reduced 
difference between the Nedelsky and Angoff procedures for the 
former than the latter group. 

It is important to identify the negative impact of judge 
knowledge on standard-setting. To a certain degree, subjectively 
derived standards of minimal competency are expected to reflect 
the competency of the people who derive them. On the other hand, 
it is unrealistic to expect judges to be uniformly competent with 
respect to every item on the test. Further research should seek 
a better understanding of the "judge competency standard" 
phenomenon and find ways to minimize it. 
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Footnote 

J The empirical item difficulty can be obtained in three ways. 
The most straightforward way is to determine the proportion of 
people getting the item right from a sample of minimally 
competent examinees (Plake et al., 1991). When such sample is 
unavailable as often is the case, it can be derived from certain 
part of the distribution when the test is administered to a total 
group of examinees (Plake et al. , 1991). Finally, it can be 
mathematically estimated through the application of an IRT model 
(ven de Linden, 1982; Friedman & Ho, 1990; Plake et al., 1991). 
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Variance estimates from G-studies 
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Source 



df 



MS 



a 2 



Item (i) 
Judge ( j ) 



Total Sample 

121 4.480 

16 15.300 

1936 0.306 



Homogeneous Knowledge Sample 
Item (i) 45 2.086 

Judge (j) 16 4.329 

ji 720 0.271 

Heterogeneous Knowledge Sample 
Item (i) 45 2.175 

Judge (j) 16 6.941 

ji 720 0.327 



.24541 
. 12287 
.30653 

. 10675 
. 08822 
.27108 

. 10872 
. 14376 
.32717 
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Table 3 

T-Test Results 



Judge Competency 21 



Judge 




Angof f 


Scores 




T-Test 




Items 


Riaht 


Items 


Wrong 






Xj 




Xj 


n'i 




1 


.69 


86 


.53 


36 


3. 08** 


2 


. 58 


86 


.43 


36 


2.87** 


3 


.54 


84 


.36 


38 


3. 17*** 


4 


.72 


107 


.60 


15 


2.42* 


5 


.79 


94 


.61 


28 


5.21*** 


6 


.63 


93 


.54 


29 


2.37* 


7 


. 51 


86 


.40 


36 


1.97* 


8 


.71 


102 


.44 


20 


5. 62*** 


9 


.72 


87 


. 53 


35 


3.33** 


10 


.75 


91 


. 59 


31 


3.62** 


11 


.61 


90 


. 37 


32 


5. 13*** 


12 


.57 


90 


.42 


32 


4 . 02*** 


13 


.47 


68 


.24 


54 


4.41*** 


14 


.40 


102 


.33 


20 


1.44 


15 


.66 


107 


.48 


15 


3 . 87*** 


16 


.70 


109 


.37 


13 


5.34*** 



Note , Xj is a judge's mean Angof f rating based on 
n^ items for which he/she got right (Items Right) 
or wrong (Items Wrong) when taking the test. 

*jd<.05, two-tailed. "jd<.01, two-tailed. 001, two-tailed. 
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